Hamiltonian MCMC and Fitting Bayesian Multilevel Regression

Presented by AG Schissler

04/29/2020

Start-of-class work: MCMC inefficiencies

Watch a Metropolis algorithm “get stuck” on the “donut” example. Discuss why this may be occurring.

mcmc demo

I. Why HMC?

Overview of MCMC Strategies

Excellent reference text: Brooks, S., Gelman, A., Jones, G., & Meng, X. L. (Eds.). (2011). Handbook of markov chain monte carlo. CRC press.

Problem with Gibbs sampling (GS)

Hamiltonian dynamics to the rescue

II. Brief overview of HMC

From MacKay, David JC. Information theory, inference and learning algorithms. Cambridge university press, 2003.

“The HMC method is a Metropolis method, applicable to continuous state spaces, that makes use of gradient information to reduce random walk behavior… It seems wasteful to use a simple random-walk Metropolis method when this gradient is available - the gradient indicates which direction one should go in to find states that have higher probability!”

See MacKay, Section 30.1 for a basic implementation of HMC and mathematical details. Let’s return to the MCMC demonstration and I’ll discuss the intuition.

mcmc demo

More HMC readings and instructional materials

III. Using software to do HMC

Stan

Problem with regular HMC U-turns

mcmc demo

Advantages of Stan

Derived from

Gelman, Andrew, Daniel Lee, and Jiqiang Guo. “Stan: A probabilistic programming language for Bayesian inference and optimization.” Journal of Educational and Behavioral Statistics 40.5 (2015): 530-543.

"Stan was motivated by the desire to solve problems that could not be solved in reasonable time (user programming time plus run time) using other packages.

In comparing Stan to other software options, we consider several criteria: 1. Flexibility, that is, being able to fit the desired model. 2. Ease of use; user programming time. 3. Run time. 4. Scalability as dataset and model grow larger."

A walkthrough of using rstan for multilevel modeling of eight schools data

http://mc-stan.org/rstan/articles/rstan.html

A tour of the rstanarm package

Single-level regression for continuous outcomes

## list vignettes
vignette(package = 'rstanarm')
vignette(topic = 'continuous',  package = 'rstanarm')

Setting priors

vignette(topic = 'priors',  package = 'rstanarm')

Detailed use

vignette(topic = 'rstanarm',  package = 'rstanarm')

Partial pooling

vignette(topic = 'pooling',  package = 'rstanarm')

Code from Gelman and Hill textbook

You may find the following demonstrations useful when completing your HW.

demo(package = 'rstanarm')
demo(topic = 'ARM_Ch12_13',  package = 'rstanarm')

greta High performance Bayesian inference software

IV. Check the chains: diagnostics and a way to fix

Sometimes it doesn’t work

Trace plot

Convergence diagnostics

A wild chain

## install from github using devtools
## library(devtools); devtools::install_github("rmcelreath/rethinking")
## https://github.com/rmcelreath/rethinking
suppressMessages(library(rethinking))
y <- c(-1, 1)
set.seed(11)
m9.2 <- ulam(
    alist(
        y ~ dnorm( mu, sigma ),
        mu <- alpha,
        alpha ~ dnorm(0, 1000) ,
        sigma ~ dexp( 0.0001)
    ),
    data = list(y = y), chains = 2)

SAMPLING FOR MODEL '726d002e27cec1633082261fcfedb813' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 8e-06 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.08 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
Chain 1: Iteration: 100 / 1000 [ 10%]  (Warmup)
Chain 1: Iteration: 200 / 1000 [ 20%]  (Warmup)
Chain 1: Iteration: 300 / 1000 [ 30%]  (Warmup)
Chain 1: Iteration: 400 / 1000 [ 40%]  (Warmup)
Chain 1: Iteration: 500 / 1000 [ 50%]  (Warmup)
Chain 1: Iteration: 501 / 1000 [ 50%]  (Sampling)
Chain 1: Iteration: 600 / 1000 [ 60%]  (Sampling)
Chain 1: Iteration: 700 / 1000 [ 70%]  (Sampling)
Chain 1: Iteration: 800 / 1000 [ 80%]  (Sampling)
Chain 1: Iteration: 900 / 1000 [ 90%]  (Sampling)
Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.041044 seconds (Warm-up)
Chain 1:                0.030275 seconds (Sampling)
Chain 1:                0.071319 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL '726d002e27cec1633082261fcfedb813' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 3e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.03 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:   1 / 1000 [  0%]  (Warmup)
Chain 2: Iteration: 100 / 1000 [ 10%]  (Warmup)
Chain 2: Iteration: 200 / 1000 [ 20%]  (Warmup)
Chain 2: Iteration: 300 / 1000 [ 30%]  (Warmup)
Chain 2: Iteration: 400 / 1000 [ 40%]  (Warmup)
Chain 2: Iteration: 500 / 1000 [ 50%]  (Warmup)
Chain 2: Iteration: 501 / 1000 [ 50%]  (Sampling)
Chain 2: Iteration: 600 / 1000 [ 60%]  (Sampling)
Chain 2: Iteration: 700 / 1000 [ 70%]  (Sampling)
Chain 2: Iteration: 800 / 1000 [ 80%]  (Sampling)
Chain 2: Iteration: 900 / 1000 [ 90%]  (Sampling)
Chain 2: Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.041869 seconds (Warm-up)
Chain 2:                0.048629 seconds (Sampling)
Chain 2:                0.090498 seconds (Total)
Chain 2: 
precis( m9.2 )
           mean        sd        5.5%     94.5%     n_eff     Rhat4
alpha -23.55871  373.4569 -676.254615  550.0968  71.69881 1.0206676
sigma 568.94576 1210.7862    8.826338 2309.5151 150.22651 0.9997524
traceplot( m9.2 )

A tame chain

set.seed(11)
m9.3 <- ulam(
    alist(
        y ~ dnorm( mu, sigma ),
        mu <- alpha,
        ## even include a "bad" starting point
        alpha ~ dnorm(1, 10) ,
        sigma ~ dexp( 1)
    ),
    data = list(y = y), chains = 2)

SAMPLING FOR MODEL 'db8b93ccfa83872ce482c35ebed2c618' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1.2e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
Chain 1: Iteration: 100 / 1000 [ 10%]  (Warmup)
Chain 1: Iteration: 200 / 1000 [ 20%]  (Warmup)
Chain 1: Iteration: 300 / 1000 [ 30%]  (Warmup)
Chain 1: Iteration: 400 / 1000 [ 40%]  (Warmup)
Chain 1: Iteration: 500 / 1000 [ 50%]  (Warmup)
Chain 1: Iteration: 501 / 1000 [ 50%]  (Sampling)
Chain 1: Iteration: 600 / 1000 [ 60%]  (Sampling)
Chain 1: Iteration: 700 / 1000 [ 70%]  (Sampling)
Chain 1: Iteration: 800 / 1000 [ 80%]  (Sampling)
Chain 1: Iteration: 900 / 1000 [ 90%]  (Sampling)
Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.012427 seconds (Warm-up)
Chain 1:                0.009018 seconds (Sampling)
Chain 1:                0.021445 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'db8b93ccfa83872ce482c35ebed2c618' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 3e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.03 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:   1 / 1000 [  0%]  (Warmup)
Chain 2: Iteration: 100 / 1000 [ 10%]  (Warmup)
Chain 2: Iteration: 200 / 1000 [ 20%]  (Warmup)
Chain 2: Iteration: 300 / 1000 [ 30%]  (Warmup)
Chain 2: Iteration: 400 / 1000 [ 40%]  (Warmup)
Chain 2: Iteration: 500 / 1000 [ 50%]  (Warmup)
Chain 2: Iteration: 501 / 1000 [ 50%]  (Sampling)
Chain 2: Iteration: 600 / 1000 [ 60%]  (Sampling)
Chain 2: Iteration: 700 / 1000 [ 70%]  (Sampling)
Chain 2: Iteration: 800 / 1000 [ 80%]  (Sampling)
Chain 2: Iteration: 900 / 1000 [ 90%]  (Sampling)
Chain 2: Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.01224 seconds (Warm-up)
Chain 2:                0.010001 seconds (Sampling)
Chain 2:                0.022241 seconds (Total)
Chain 2: 
precis( m9.3 )
            mean        sd       5.5%    94.5%    n_eff    Rhat4
alpha 0.08552607 1.1880916 -1.6238445 2.105847 301.2610 1.003915
sigma 1.57960741 0.8229531  0.6827578 3.088258 247.0879 1.005284
traceplot( m9.3 )

“Folk Theorem” of statistical computing

Prof Andrew Gelman says to heed it

V. Visualization in Bayesian model development

This is a good workflow to employ:

Gabry et al Visualization in Bayesian workflow

The four basic steps are

  1. Specify a posterior distribution (by specifying the priors and likelihood).
  2. Draw from the posterior distribution. You should generate fake data to ensure this is working properly.
  3. Critize the model. You should compare predictons to the data to assess the quality of the model.
  4. Analyze manipulations of predictors and parameters. Use to the posterior samples to conduct statistical inference. Such as, credible intervals, posterior predictions, Bayesian hypothesis testings, etc.

Closing

In what situations do we need to use more advanced and efficient MC algorithms?